perf: Improve performance of split_part#19570
Conversation
split_part
| .try_for_each(|((string, delimiter), n)| -> Result<(), DataFusionError> { | ||
| match (string, delimiter, n) { | ||
| (Some(string), Some(delimiter), Some(n)) => { | ||
| let split_string: Vec<&str> = string.split(delimiter).collect(); |
There was a problem hiding this comment.
This was allocating strings for all parts even if only some parts were needed
|
46x faster 👍 |
comphead
left a comment
There was a problem hiding this comment.
Thanks @andygrove the early return makes much more sense than eagerly calculating all the parts
| std::cmp::Ordering::Greater => { | ||
| // Positive index: use nth() to avoid collecting all parts | ||
| // This stops iteration as soon as we find the nth element | ||
| string.split(delimiter).nth((n - 1) as usize) |
There was a problem hiding this comment.
Are 32-bit systems supported ?
n is Int64, so it is possible that this cast may lead to a truncation or even a crash in debug build
There was a problem hiding this comment.
Good catch, thanks. I changed to use try_into with appropriate error handling
| std::cmp::Ordering::Less => { | ||
| // Negative index: use rsplit().nth() to efficiently get from the end | ||
| // rsplit iterates in reverse, so -1 means first from rsplit (index 0) | ||
| string.rsplit(delimiter).nth((-n - 1) as usize) |
There was a problem hiding this comment.
another corner case: -n will fail for i64::MIN
There was a problem hiding this comment.
Good catch, thanks. I changed to use try_into with appropriate error handling
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Which issue does this PR close?
Rationale for this change
I ran microbenchmarks comparing DataFusion with DuckDB for string functions (see apache/datafusion-benchmarks#26) and noticed that DF was very slow for
split_part.This PR fixes some obvious performance issues. Speedups are:
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?